Adaptive load balancing in a multi-processor graphics processing system

ABSTRACT

Systems and methods for balancing a load among multiple graphics processors that render different portions of a frame. A display area is partitioned into portions for each of two (or more) graphics processors. The graphics processors render their respective portions of a frame and return feedback data indicating completion of the rendering. Based on the feedback data, an imbalance can be detected between respective loads of two of the graphics processors. In the event that an imbalance exists, the display area is re-partitioned to increase a size of the portion assigned to the less heavily loaded processor and to decrease a size of the portion assigned to the more heavily loaded processor.

CROSS-REFERENCES TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.10/642,905, filed Aug. 18, 2003, which disclosure is incorporated hereinby reference for all purposes.

The present disclosure is related to the following commonly assigned copending U.S. Patent Applications: application Ser. No. 10/643,072, filedon the same date as the present application, entitled “PrivateAddressing in a Multi Processor Graphics Processing System” andapplication Ser. No. 10/639,893, filed Aug. 12, 2003, entitled“Programming Multiple Chips from a Command Buffer,” the respectivedisclosures of which are incorporated herein by reference for allpurposes.

BACKGROUND OF THE INVENTION

The present invention relates generally to graphics processingsubsystems with multiple processors and in particular to adaptive loadbalancing for such graphics processing subsystems.

Graphics processing subsystems are designed to render realistic animatedimages in real time, e.g., at 30 or more frames per second. Thesesubsystems are most often implemented on expansion cards that can beinserted into appropriately configured slots on a motherboard of acomputer system and generally include one or more dedicated graphicsprocessing units (GPUs) and dedicated graphics memory. The typical GPUis a highly complex integrated circuit device optimized to performgraphics computations (e.g., matrix transformations, scan-conversionand/or other rasterization techniques, texture blending, etc.) and writethe results to the graphics memory. The GPU is a “slave” processor thatoperates in response to commands received from a driver programexecuting on a “master” processor, generally the central processing unit(CPU) of the system.

To meet the demands for realism and speed, some GPUs include moretransistors than typical CPUs. In addition, graphics memories havebecome quite large in order to improve speed by reducing traffic on thesystem bus; some graphics cards now include as much as 256 MB of memory.But despite these advances, a demand for even greater realism and fasterrendering persists.

As one approach to meeting this demand, some manufacturers have begun todevelop “multi-chip” graphics processing subsystems in which two or moreGPUs, usually on the same card, operate in parallel. Parallel operationsubstantially increases the number of rendering operations that can becarried out per second without requiring significant advances in GPUdesign. To minimize resource conflicts between the GPUs, each GPU isgenerally provided with its own dedicated memory area, including adisplay buffer to which the GPU writes pixel data it renders.

In a multi-chip system, the processing burden may be divided among theGPUs in various ways. For example, each GPU may be instructed to renderpixel data for a different portion of the displayable image, such as anumber of lines of a raster-based display. The image is displayed byreading out the pixel data from each GPU's display buffer in anappropriate sequence. As a more concrete example, a graphics processingsubsystem may use two GPUs to generate a displayable image consisting ofM rows of pixel data; the first GPU can be instructed to render rows 1through P, while the second GPU is instructed to render rows P+1 throughM. To preserve internal consistency of the displayed image (“framecoherence”), each GPU is prevented from rendering a subsequent frameuntil the other GPU has also finished the current frame so that bothportions of the displayed image are updated in the same scanout pass.

Ideally, the display area (or screen) is partitioned in such a way thateach GPU requires an equal amount of time to render its portion of theimage. If the rendering times are unequal, a GPU that finishes itsportion of the frame first will be idle, wasting valuable computationalresources. In general, simply partitioning the display area equallyamong the GPUs is not an optimal solution because the renderingcomplexity of different parts of an image can vary widely. For example,in a typical scene from a video game, the foreground characters and/orvehicles—which are often complex objects rendered from a large number ofprimitives—tend to appear near the bottom of the image, while the topportion of the image is often occupied by a relatively static backgroundthat can be rendered from relatively few primitives and texture maps.When such an image is split into top and bottom halves, the GPU thatrenders the top half will generally complete its portion of the image,then wait for the other GPU to finish. To avoid this idle time, it wouldbe desirable to divide the display area unequally, with the top portionbeing larger than the bottom portion. In general, the optimal divisiondepends on the particular scene being rendered and may vary over timeeven within a single video game or other graphics application.

It would, therefore, be desirable to provide a mechanism whereby theprocessing load on each GPU can be monitored and the division of thedisplay area among the GPUs can be dynamically adjusted to balance theloads.

BRIEF SUMMARY OF THE INVENTION

The present invention provides systems and methods for balancing a loadamong multiple graphics processors that render different portions of aframe.

According to one aspect of the invention, a method is provided for loadbalancing for graphics processors configured to operate in parallel. Adisplay area is partitioned into at least a first portion to be renderedby a first one of the graphics processors and a second portion to berendered by a second one of the graphics processors. The graphicsprocessors are instructed to render a frame, wherein the first andsecond graphics processors perform rendering for the first and secondportions of the display area, respectively. Feedback data for the frameis received from the first and second graphics processors, the feedbackdata reflecting respective rendering times for the first and secondgraphics processors. Based on the feedback data, it is determinedwhether an imbalance exists between respective loads of the first andsecond graphics processors. In the event that an imbalance exists, basedon the feedback data, the one of the first and second graphicsprocessors that is more heavily loaded is identified; the display areais re-partitioned to increase a size of the one of the first and secondportions of the display area that is rendered by the more heavily loadedone of the first and second graphics processors and to decrease a sizeof the other of the first and second portions of the display area.

According to another aspect of the invention, a method is provided forload balancing for graphics processors configured to operate inparallel. A display area is partitioned into at least a first portion tobe rendered by a first graphics processor and a second portion to berendered by a second graphics processor. The graphics processors areinstructed to render a number of frames, wherein the first and secondgraphics processors perform rendering for the first and second portionsof the display area, respectively. Feedback data for each of the framesis received from the first and second graphics processors, the feedbackdata for each frame indicating which of the first and second graphicsprocessors was last to finish rendering the frame. Based on the feedbackdata, it is determined whether an imbalance exists between respectiveloads of the first and second graphics processors. In the event that animbalance exists, based on the feedback data, the one of the first andsecond graphics processors that is more heavily loaded is identified;the display area is re-partitioned to increase a size of the one of thefirst and second portions of the display area that is rendered by themore heavily loaded one of the first and second graphics processors andto decrease a size of the other of the first and second portions of thedisplay area.

In some embodiments, a storage location is associated with each one ofthe frames, and receiving the feedback data for each of the framesincludes instructing the first graphics processor to store a firstprocessor identifier in the associated one of the storage locations foreach of the frames after rendering the first portion of the display areafor that frame; and instructing the second graphics processor to store asecond processor identifier different from the first processoridentifier in the associated one of the storage locations for each ofthe frames after rendering the second portion of the display area forthat frame. Each of the first and second identifiers may have adifferent numeric value and determination of whether an imbalance existsmay include computing a load coefficient from the numeric values storedin the storage locations. The load coefficient may be, e.g., an averageof the recorded numeric values that can be compared to an arithmeticmean of the numeric values of the processor identifiers in order todetermine whether an imbalance exists.

In some embodiments, during the act of re-partitioning, an amount bywhich the size of the first portion of the display area is reduceddepends at least in part on a magnitude of the difference between theload coefficient and the arithmetic mean.

In some embodiments, the plurality of graphics processors furtherincludes a third graphics processor. During the act of partitioning, thedisplay area may be partitioned into at least three bands including afirst band that corresponds to the first portion of the display area, asecond band that corresponds to the second portion of the display area,and a third band that corresponds to a third portion of the display areato be rendered by the third graphics processor, wherein the first bandis adjacent to the second band and the second band is adjacent to thethird band. Additional feedback data may be received for each of theframes, the additional feedback data indicating which of the second andthird graphics processors was last to finish rendering the frame. Basedon the feedback data, it may be determined whether an imbalance existsbetween respective loads of the second and third graphics processors. Inthe event that an imbalance exists, it may be determined which of thesecond and third graphics processors is more heavily loaded, and thedisplay area may be re-partitioned to increase a size of the one of thesecond and third portions of the display area that is rendered by themore heavily loaded one of the second and third graphics processors andto decrease a size of the other of the second and third portions of thedisplay area.

According to yet another aspect of the invention, a driver for agraphics processing subsystem having multiple graphics processorsincludes a command stream generator, an imbalance detecting module, anda partitioning module. The command stream generator is configured togenerate a command stream for the graphics processors, the commandstream including a set of rendering commands for a frame and aninstruction to each of a first one and a second one of the graphicsprocessors to transmit feedback data indicating that the respectiveprocessor has executed the set of rendering commands. The imbalancedetecting module is configured to receive the feedback data transmittedby the first and second graphics processors and to determine from thefeedback data whether an imbalance exists between respective loads ofthe first and second graphics processors. The partitioning module isconfigured to partition a display area into a plurality of portions,each portion to be rendered by a different one of the graphicsprocessors, the plurality of portions including a first portion to berendered by the first graphics processor and a second portion to berendered by the second graphics processor. The partitioning module isfurther configured such that, in response to a determination by theimbalance detecting module that an imbalance exists, the partitioningmodule increases a size of the one of the first and second portions ofthe display area that is rendered by the more heavily loaded one of thefirst and second graphics processors and decreases a size of the otherof the first and second portions of the display area.

The following detailed description together with the accompanyingdrawings will provide a better understanding of the nature andadvantages of the present invention.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a simplified block diagram of a computer system according toan embodiment of the present invention;

FIG. 2 is an illustration of a display area showing spatial parallelismaccording to an embodiment of the present invention;

FIG. 3 is an illustration of a command stream according to an embodimentof the present invention;

FIG. 4 is a flow diagram of a process for providing feedback data from agraphics processing unit according to an embodiment of the presentinvention;

FIG. 5 is a flow diagram of a process for balancing a load between twographics processing units according to an embodiment of the presentinvention;

FIG. 6 is an illustration of a display area showing three-way spatialparallelism according to an embodiment of the present invention;

FIG. 7 is an illustration of a pair of feedback arrays for three-wayspatial parallelism according to an embodiment of the present invention;

FIG. 8 is an illustration of a display area showing four-way spatialparallelism according to an embodiment of the present invention;

FIG. 9 is a simplified block diagram of a multi-card graphics processingsystem according to an embodiment of the present invention; and

FIG. 10 is an illustration of command streams for a multi-card graphicsprocessing system according to an embodiment of the present invention.

DETAILED DESCRIPTION OF THE INVENTION

The present invention provides systems and methods for balancing a loadamong multiple graphics processors that render different portions of aframe. In some embodiments, load balancing is performed by determiningwhether one of two graphics processors finishes rendering a frame lastmore often than the other. If one of the processors finishes last moreoften, a portion of the processing burden (e.g., a number of lines ofpixels to render) is shifted from that processor to the other processor.The comparison can be repeated and the load adjusted as often asdesired. The technique of pairwise load comparisons and balancing can beextended to systems with any number of graphics processors.

FIG. 1 is a block diagram of a computer system 100 according to anembodiment of the present invention. Computer system 100 includes acentral processing unit (CPU) 102 and a system memory 104 communicatingvia a bus 106. User input is received from one or more user inputdevices 108 (e.g., keyboard, mouse) coupled to bus 106. Visual output isprovided on a pixel based display device 10 (e.g., a conventional CRT orLCD based monitor) operating under control of a graphics processingsubsystem 112 coupled to system bus 106. A system disk 128 and othercomponents, such as one or more removable storage devices 129 (e.g.,floppy disk drive, compact disk (CD) drive, and/or DVD drive), may alsobe coupled to system bus 106.

Graphics processing subsystem 112 is advantageously implemented using aprinted circuit card adapted to be connected to an appropriate bus slot(e.g., PCI or AGP) on a motherboard of system 100. In this embodiment,graphics processing subsystem 112 includes two (or more) graphicsprocessing units (GPUs) 14 a, 114 b, each of which is advantageouslyimplemented as a separate integrated circuit device (e.g., programmableprocessor or application-specific integrated circuit (ASIC)). GPUs 114a, 114 b are configured to perform various rendering functions inresponse to instructions (commands) received via system bus 106. In someembodiments, the rendering functions correspond to various steps in agraphics processing pipeline by which geometry data describing a sceneis transformed to pixel data for displaying on display device 110. Thesefunctions can include, for example, lighting transformations, coordinatetransformations, scan-conversion of geometric primitives to rasterizeddata, shading computations, shadow rendering, texture blending, and soon. Numerous implementations of rendering functions are known in the artand may be implemented in GPUs 114 a, 114 b. GPUs 114 a, 114 b areadvantageously configured identically so that any graphics processinginstruction can be executed by either GPU with substantially identicalresults.

Each GPU 114 a, 114 b has an associated graphics memory 116 a, 116 b,which may be implemented using one or more integrated-circuit memorydevices of generally conventional design. Graphics memories 116 a, 116 bmay contain various physical or logical subdivisions, such as displaybuffers 122 a, 122 b and command buffers 124 a, 124 b. Display buffers122 a, 122 b store pixel data for an image (or for a part of an image)that is read by scanout control logic 120 and transmitted to displaydevice 110 for display. This pixel data may be generated from scene dataprovided to GPUs 114 a, 114 b via system bus 106 or generated by variousprocesses executing on CPU 102 and provided to display buffers 122 a,122 b via system bus 106. In some embodiments, display buffers 122 a,122 b can be double buffered so that while data for a first image isbeing read for display from a “front” buffer, data for a second imagecan be written to a “back” buffer without affecting the currentlydisplayed image. Command buffers 124 a, 124 b are used to queue commandsreceived via system bus 106 for execution by respective GPUs 114 a, 114b, as described below. Other portions of graphics memories 116 a, 116 bmay be used to store data required by respective GPUs 114 a, 114 b (suchas texture data, color lookup tables, etc.), executable program code forGPUs 114 a, 114 b, and so on.

For each graphics memory 116 a, 116 b, a memory interface 123 a, 123 bis also provided for controlling access to the respective graphicsmemory. Memory interfaces 123 a, 123 b can be integrated with respectiveGPUs 114 a, 114 b or with respective memories 116 a, 116 b, or they canbe implemented as separate integrated circuit devices. In oneembodiment, all memory access requests originating from GPU 114 a aresent to memory interface 123 a. If the target address of the requestcorresponds to a location in memory 116 a, memory interface 123 aaccesses the appropriate location; if not, then memory interface 123 aforwards the request to a bridge unit 130, which is described below.Memory interface 123 a also receives all memory access requeststargeting locations in memory 116 a; these requests may originate fromscanout control logic 120, CPU 102, or other system components, as wellas from GPU 114 a or 114 b. Similarly, memory interface 123 b receivesall memory access requests that originate from GPU 114 b or that targetlocations in memory 116 b.

Bridge unit 130 is configured to manage communication between componentsof graphics processing subsystem 112 (including memory interfaces 123 a,123 b) and other components of system 100. For example, bridge unit 130may receive all incoming data transfer requests from system bus 106 anddistribute (or broadcast) the requests to one or more of memoryinterfaces 123 a, 123 b. Bridge unit 130 may also receive data transferrequests originating from components of graphics processing subsystem112 (such as GPUs 114 a, 114 b) that reference memory locations externalto graphics processing subsystem 112 and transmit these requests viasystem bus 106. In addition, in some embodiments, bridge unit 130facilitates access by either of GPUs 114 a, 114 b to the memory 116 b,116 a associated with the other of GPUs 114 a, 114 b. Examples ofimplementations of bridge unit 130 are described in detail in theabove-referenced co-pending application Ser. No. 10/643,072; a detaileddescription is omitted herein as not being critical to understanding thepresent invention.

In operation, a graphics driver program (or other program) executing onCPU 102 delivers rendering commands and associated data for processingby GPUs 114 a, 114 b. In some embodiments, CPU 102 communicatesasynchronously with each of GPUs 114 a, 114 b using a command buffer,which may be implemented in any memory accessible to both the CPU 102and the GPUs 114 a, 114 b. In one embodiment, the command buffer isstored in system memory 104 and is accessible to GPUs 114 a, 114 b viadirect memory access (DMA) transfers. In another embodiment, each GPU114 a, 114 b has a respective command buffer 124 a, 124 b in its memory116 a, 116 b; these command buffers are accessible to CPU 102 via DMAtransfers. The command buffer stores a number of rendering commands andsets of rendering data. In one embodiment, a rendering command may beassociated with rendering data, with the rendering command defining aset of rendering operations to be performed by the GPU on the associatedrendering data. In some embodiments, the rendering data is stored in thecommand buffer adjacent to the associated rendering command.

CPU 102 writes a command stream including rendering commands and datasets to the command buffer for each GPU 114 a, 114 b (e.g., commandbuffers 124 a, 124 b). In some embodiments, the same rendering commandsand data are written to each GPU's command buffer (e.g., using abroadcast mode of bridge chip 130); in other embodiments, CPU 102 writesto each GPU's command buffer separately. Where the same command streamis provided to both GPUs 114 a, 114 b, the command stream may includetags or other parameters to indicate which of the GPUs should process aparticular command.

Each command buffer 124 a, 124 b is advantageously implemented as afirst-in, first-out buffer (FIFO) that is written by CPU 102 and read bythe respective one of GPUs 114 a, 114 b; reading and writing can occurasynchronously. In one embodiment, CPU 102 periodically writes newcommands and data to each command buffer at a location determined by a“put” pointer, which CPU 102 increments after each write.Asynchronously, each of GPUs 114 a, 114 b continuously reads andprocesses commands and data sets previously stored in its command buffer124 a, 124 b; each GPU 114 a, 114 b maintains a “get” pointer toidentify the read location in its command buffer 124 a, 124 b, and theget pointer is incremented after each read. Provided that CPU 102 stayssufficiently far ahead of GPUs 114 a, 114 b, the GPUs are able to renderimages without incurring idle time waiting for CPU 102. In someembodiments, depending on the size of the command buffer and thecomplexity of a scene, CPU 102 may write commands and data sets forframes several frames ahead of a frame being rendered by GPUs 114 a, 114b.

The command buffer may be of fixed size (e.g., 5 megabytes) and may bewritten and read in a wraparound fashion (e.g., after writing to thelast location, CPU 102 may reset the “put” pointer to the firstlocation). A more detailed description of embodiments of command buffersand techniques for writing commands and data to command buffers in amulti-chip graphics processing system is provided in theabove-referenced co-pending application Ser. No. 10/639,893.

Scanout control logic 120 reads pixel data for an image from framebuffers 122 a, 122 b and transfers the data to display device 110 to bedisplayed. Scanout can occur at a constant refresh rate (e.g., 80 Hz);the refresh rate can be a user selectable parameter and need notcorrespond to the rate at which new frames of image data are written todisplay buffers 122 a, 122 b. Scanout control logic 120 may also performother operations such as adjustment of color values, generatingcomposite screen images by combining the pixel data in either of thedisplay buffers 122 a, 122 b with data for a video or cursor overlayimage or the like obtained from either of graphics memories 116 a, 116 bor another data source (not shown), digital to analog conversion, and soon.

GPUs 114 a, 114 b are advantageously operated in parallel to increasethe rate at which new frames of image data can be rendered. In oneembodiment, referred to herein as “spatial parallelism,” each GPU 114 a,114 b generates pixel data for a different portion (e.g., a horizontalor vertical band) of each frame; scanout control logic 120 reads a firstportion (e.g., the top portion) of the pixel data for a frame fromdisplay buffer 122 a and a second portion (e.g., the bottom portion)from display buffer 122 b. For spatial parallelism, rendering commandsand accompanying data may be written in parallel to both command buffers124 a, 124 b (e.g., using a broadcast mode of bridge unit 130), butcommands and/or data can also be selectively written to one or more ofthe command buffers (e.g., different parameters for a command thatdefines the viewable area might be written to the different commandbuffers so that each GPU renders the correct portion of the image).

An example of spatial parallelism is shown in FIG. 2. A display area 200consists of M lines (horizontal rows) of pixel data. Lines 1 through P(corresponding to top portion 202 of display area 200) are rendered byGPU 114 a of FIG. 1, while lines P+1 through M (corresponding to bottomportion 204 of display area 200) are rendered by GPU 114 b. In thisembodiment, each GPU 114 a, 114 b allocates a display buffer 122 a, 122b in its local memory 116 a, 116 b that is large enough to store anentire frame (M lines) of data but only fills the lines it renders(lines 1 through P for GPU 114 a and lines P+1 through M for GPU 114 b).During each display refresh cycle, scanout control logic 120 reads thefirst P lines from display buffer 122 a, then switches to display buffer122 b to read lines P+1 through M. To determine which lines each GPUrenders, a “clip rectangle” is set for each GPU; for example, GPU 114 amay have a clip rectangle corresponding to top portion 202 of frame 200while GPU 114 b has a clip rectangle corresponding to bottom portion 204of frame 200.

In accordance with an embodiment of the present invention, each GPUprovides feedback data to the graphics driver program (or anotherprogram executing on CPU 102). The feedback data provides informationabout the time taken by a particular GPU to render its portion of theimage. The graphics driver program uses this feedback to dynamicallybalance the load among the GPUs by modifying the clip rectangle fromtime to time, e.g., by changing the dividing line to a different lineP′, based on the relative loads on the two GPUs.

An example of a command stream 300 that may be written to either (orboth) of command buffers 124 a, 124 b is shown in FIG. 3. The streamstarts with a “clip rectangle” (CR) command 302, which defines theviewable area of the image. For example, the clip rectangle for GPU 114a may be defined to include lines 1 through P of display area 200 (FIG.2), while the clip rectangle for GPU 114 b includes lines P+1 through M.As used herein, the term “clip rectangle” is to be understood asincluding any particular command or terminology associated with definingthe visible portion of the image plane for a frame or image, or morespecifically, the portion of the image plane that a particular GPU isinstructed to render.

The clip rectangle command is followed by one or more rendering commands304 and associated rendering data for a frame F0. These commands anddata may include, for instance, definitions of primitives and/or objectsmaking up the scene, coordinate transformations, lightingtransformations, shading commands, texture commands, and any other typeof rendering commands and/or data, typically culminating in the writingof pixel data to display buffers 122 a, 122 b (and reading of that databy scanout control logic 120).

Following the last rendering command 304 for frame F0 is a “writenotifier” (WN) command 306. The write notifier command instructs the GPUto write feedback data to system memory indicating that it has finishedthe frame F0. This feedback data can be read by the graphics driverprogram and used to balance the load among the GPUs. Specificembodiments of feedback data are described below.

Write notifier command 306 is followed by rendering commands 308 andassociated rendering data for the next frame F1, which in turn arefollowed by another write notifier command 310, and so on. After somenumber (Q) of frames, there is a write notifier command 322 followed bya new clip rectangle command 324. At this point, the clip rectangles foreach GPU may be modified by the graphics driver program based on thefeedback data received in response to the various write notifiercommands (e.g., commands 306, 310). For example, where the display areais divided as shown in FIG. 2, the value of P may be modified (e.g., toP′) in response to feedback data: if the GPU that processes top portion202 tends to finish its frames first, the value of P is increased, andif the GPU that processes bottom portion 204 tends to finish first, thevalue of P is decreased. Specific embodiments of re-partitioning adisplay area in response to feedback data are described below.

It will be appreciated that the system described herein is illustrativeand that variations and modifications are possible. For instance, whiletwo GPUs, with respective memories, are shown, any number of GPUs can beused, and multiple GPUs might share a memory. The memory interfacesdescribed herein may be integrated with a GPU and/or a memory in asingle integrated circuit device (chip) or implemented as separatechips. The bridge unit may be integrated with any of the memoryinterface and/or GPU chips, or may be implemented on a separate chip.The various memories can be implemented using one or more integratedcircuit devices. Graphics processing subsystems can be implemented usingvarious expansion card formats, including PCI, PCIX (PCI Express), AGP(Accelerated Graphics Port), and so on. Some or all of the components ofa graphics processing subsystem may be mounted directly on amotherboard; for instance, one of the GPUs can be a motherboard-mountedgraphics co-processor. Computer systems suitable for practicing thepresent invention may also include various other components, such ashigh-speed DMA (direct memory access) chips, and a single system mayimplement multiple bus protocols (e.g., PCI and AGP buses may both bepresent) with appropriate components provided for interconnecting thebuses. One or more command buffers may be implemented in the main systemmemory rather than graphics subsystem memory, and commands may includean additional parameter indicating which GPU(s) is (are) to receive orprocess the command. While the present description may refer toasynchronous operation, those skilled in the art will recognize that theinvention may also be implemented in systems where the CPU communicatessynchronously with the GPUs.

Embodiments of feedback data and load balancing techniques based on thefeedback data will now be described. In one embodiment, each GPU 114 a,114 b is assigned an identifier that it stores in a designated locationin its local memory 116 a, 116 b; the identifier may also be stored inan on-chip register of each GPU 114 a, 114 b. For example, GPU 114 a canbe assigned an identifier “0” while GPU 114 b is assigned an identifier“1.” These identifiers, which advantageously have numerical values, maybe assigned, e.g., at system startup or application startup. Asdescribed below, the identifier may be used as feedback data forpurposes of load balancing.

FIG. 4 illustrates a process 400 for recording feedback data includingthe identifiers of the GPUs. At step 402, the graphics driver programcreates a feedback array (referred to herein as feedback[0:B-1]) ofdimension B (e.g., 5, 10, 20, 50, etc.) in system main memory, and atstep 404, a frame counter k is initialized (e.g., to zero). In thisembodiment, the write notifier command following each frame k instructsthe GPU to copy its identifier from its local memory to the locationfeedback[k] in system main memory, e.g., using a DMA block transferoperation (“Blit”) or any other operation by which a GPU can write datato system main memory. Thus, at step 406, the first GPU to finishrendering frame k writes its identifier to the array locationfeedback[k]. At step 408, the second GPU to finish rendering the frame kwrites its identifier to the array location feedback[k], overwriting thefirst GPU's identifier. It is to be understood that either GPU 114 a,114 b might finish first, and that a GPU that is first to finish oneframe first might be last to finish another frame.

It should be noted that in this embodiment each GPU is instructed towrite to the same location in system memory; as a result, the second GPUto finish frame k overwrites the identifier of the first GPU in arrayelement feedback[k]. Thus, after both GPUs have finished a particularframe k, the value stored in feedback[k] indicates which GPU was last tofinish the frame k.

At step 410, the frame counter is incremented to the next frame, moduloB. This causes the feedback array to be overwritten in a circularfashion every B frames, so that the contents of the array generallyreflect the last B frames that have been rendered. In one embodiment,the frame counter value for each frame is provided with the writenotification command to each GPU; in another embodiment, each GPUmaintains its own frame counter and updates the frame counter afterwriting the identifier to the appropriate location in system memory inresponse to the write notifier command.

The information in the feedback array can be used by a graphics driverprogram (or another program executing on CPU 102) for load balancing, asillustrated in FIG. 5. Process 500 is a shown as a continuous loop inwhich the relative load on the GPUs is estimated from time to time byaveraging values stored in the feedback array and the load is adjustedbased on the estimate. In this embodiment, there are two GPUs (e.g.,GPUs 114 a, 114 b of FIG. 1) operating in spatial parallelism and thedisplay area is divided as shown in FIG. 2. The GPU assigned to the topportion 202 of the display area has identifier “0” and is referred toherein as GPU-0, and the GPU assigned to the bottom portion 204 hasidentifier “1” and is referred to herein as GPU-1. Load balancing isdone by adjusting the clip rectangle for each GPU, determined in thisexample by the location of the boundary line P in FIG. 2.

At step 501, a clip rectangle command is issued (e.g., placed in thecommand stream) for each GPU. This initial clip rectangle command maypartition the display area equally between the GPUs (e.g., using P=M/2)or unequally. For example, a developer of an application program mayempirically determine a value of P that approximately balances that loadand provide that value to the graphics driver program via an appropriatecommand. The initial size of the portion of the display area allocatedto each GPU is not critical, as the sizes will typically be changed fromtime to time to balance the load.

At step 502, the graphics driver determines whether it is time tobalance the load between the GPUs. Various criteria may be used in thisdetermination; for example, the graphics driver may balance the loadafter some number (Q) of frames, where Q might be, e.g., 1, 2, 5, 10,20, etc. Q advantageously does not exceed the number of entries B in thefeedback array, but Q need not be equal to B. Alternatively, loadbalancing may be performed at regular time intervals (e.g., once persecond) or according to any other criteria. If it is not time to balancethe load, process 500 waits (step 504), then checks the load balancingcriteria again at step 502.

When it is time to balance the load, the graphics driver averages Qvalues from the feedback array at step 506, thereby computing a loadcoefficient. In one embodiment Q is equal to B (the length of thefeedback array), but other values may be chosen. It should be noted thatthe graphics driver and the GPUs may operate asynchronously with the CPUas described above, so that the graphics driver might not know whetherthe GPUs have finished a particular frame and the GPUs may be renderinga frame that is several frames earlier in the command stream than acurrent frame in the graphics driver. Where the feedback array iswritten in a circular fashion, as in process 400 described above,selecting Q to be equal to B provides an average over the B mostrecently rendered frames. In some embodiments, a weighted average may beused, e.g., giving a larger weight to more recently-rendered frames.

The load coefficient is used to determine whether an adjustment to theclip rectangles for the GPUs needs to be made. If the GPUs are equallyloaded, the likelihood of either GPU finishing a frame first is about50%, and the average value over a suitable number of frames (e.g., 20)will be about 0.5 if identifier values of 0 and 1 are used. An averagevalue in excess of 0.5 indicates that GPU-1 (which renders the bottomportion of the image) is more heavily loaded than GPU-0, and an averagevalue below 0.5 indicates that GPU-0 (which renders the top portion ofthe image) is more heavily loaded than GPU-1.

Accordingly, at step 510 it is determined whether the load coefficientexceeds a “high” threshold. The high threshold is preselected and may beexactly 0.5 or a somewhat higher value (e.g., 0.55 or 0.6). If the loadcoefficient exceeds the high threshold, then the loads are adjusted atstep 512 by moving the boundary line P in FIG. 2 down by a preset amount(e.g., one line, five lines, ten lines). This reduces the fraction ofthe display area that is rendered by GPU-1, which will tend to reducethe load on GPU-1 and increase the load on GPU-0. Otherwise, at step514, it is determined whether the load coefficient is less than a “low”threshold. The low threshold is predefined and may be exactly 0.5 or asomewhat lower value (e.g., 0.45 or 0.4). If the load coefficient isbelow the low threshold, then the loads are adjusted at step 516 bymoving the boundary line P in FIG. 2 up by a preset amount (e.g., oneline, five lines, ten lines). At step 518, if the load coefficient isneither above the high threshold nor below the low threshold, the loadis considered balanced, and the boundary line P is left unchanged.

After the new boundary line P is determined, a new clip rectanglecommand is issued for each GPU (step 522) and the process returns tostep 504 to wait until it is time to balance the load again. In analternative embodiment, a new clip rectangle command is issued at step522 only if the boundary line changes. In conjunction with the new cliprectangle command, a message may be sent to the scanout control logic sothat the appropriate display buffer is selected to provide each line ofpixel data (e.g., by modifying one or more scanout parameters related toselection of display buffers). Changes in the parameters of the scanoutcontrol logic are advantageously synchronized with rendering of theframe in which the new clip rectangle takes effect; accordingly, in someembodiments, the clip rectangle command may also update the scanoutparameters in order to display the next rendered frame correctly.

In some embodiments, when the boundary line is shifted to balance theload, it may be useful to transfer data from one display buffer toanother. For example, in FIG. 2, suppose that just after GPUs 114 a, 114b have finished rendering a current frame, the value of P is changed toa larger value P′, increasing the number of lines that GPU 114 a willrender for the next frame. GPU 114 a may need access to data for some orall of lines P+1 through P′ of the current frame in order to correctlyprocess the next frame. In one embodiment, GPU 114 a can obtain the databy a DMA transfer from the portion of display buffer 122 b that has thedata for lines P+1 through P′. Examples of processes that canadvantageously be used for this purpose are described in theabove-referenced application Ser. No. 10/643,072, although numerousother processes for transferring data may also be used. It is to beunderstood that transferring data between display buffers is notrequired but may be useful in embodiments where any overhead associatedwith the data transfer is outweighed by the overhead of having one GPUrepeat computations previously performed by another GPU. Transferringdata that is not displayed (e.g., texture data) between graphicsmemories 116 a, 116 b may also be desirable in some instances and can beimplemented using any of the techniques mentioned above.

It will be appreciated that the processes described herein areillustrative and that variations and modifications are possible. Stepsdescribed as sequential may be executed in parallel, order of steps maybe varied, and steps may be modified or combined. Optimal selection ofthe number of frames to average (Q) and/or the frequency of balancinggenerally depends on various tradeoffs. For instance, a small value of Qprovides faster reactions to changes in the scene being rendered, whilea larger value of Q will tend to produce more stable results (byminimizing the effect of fluctuations) as well as reducing any effect ofan entry in the feedback array for a frame that only one GPU hasfinished (such an entry would not accurately reflect the last GPU tofinish that frame). More frequent balancing may reduce GPU idle time,while less frequent balancing tends to reduce any overhead (such as datatransfers between the memories of different GPUs) associated withchanging clip rectangles. In one embodiment, checking the balance every20 frames with Q=B=20 is effective, but in general, optimal valuesdepend on various implementation details. It should be noted thatchecking the balance can occur quite frequently; e.g., if 30 frames arerendered per second and checking occurs every 20 frames, then thebalance may change about every 0.67 seconds.

The identifiers for different GPUs may have any value. Correspondingly,the high threshold and low threshold may have any values, and the twothreshold values may be equal (e.g., both equal to 0.5), so long as thehigh threshold is not less than the low threshold. Both thresholds areadvantageously set to values near or equal to the arithmetic mean of thetwo identifiers; an optimal selection of thresholds in a particularsystem may be affected by considerations such as the frequency of loadrebalancing and any overhead associated with changing the cliprectangles assigned to each GPU. The threshold comparison isadvantageously defined such that there is some condition for which theload is considered balanced (e.g., if the average is exactly equal tothe arithmetic mean).

Prior to rendering images or writing any feedback data, the feedbackarray may be initialized, e.g., by randomly selecting either of the GPUidentifiers for each entry or by filling alternating entries withdifferent identifiers. Such initialization reduces the likelihood of aspurious imbalance being detected in the event that checking the loadbalance occurs before the GPUs have written values to all of the entriesthat are being used to determine the load coefficient.

In one alternative embodiment, the amount by which the partition changes(e.g., the number of lines by which the boundary line P is shifted) maydepend on the magnitude of the difference between the load coefficientand the arithmetic mean. For example, if the load coefficient is greaterthan 0.5 but less than 0.6, a downward shift of four lines might beused, while for a load coefficient greater than 0.6, a shift of eightlines might be used; similar shifts in the opposite direction can beimplemented for load coefficients below the arithmetic mean. In someembodiments, the difference in size of the two clip rectangles islimited to ensure that each GPU is always rendering at least a minimumportion (e.g., 10% or 25%) of the display area.

Instead of averaging, a load coefficient may be defined in other ways.For instance, the sum of the recorded identifier values may be used asthe load coefficient. In the embodiment described above, with Q=20, thestored identifier values (0 or 1) would sum to 10 if the load isbalanced; high and low thresholds may be set accordingly. Otherarithmetic operations that may be substituted for those described hereinwill also be apparent to those of ordinary skill in the art and arewithin the scope of the present invention.

In another alternative embodiment, different feedback data may be usedinstead of or in addition to the GPU identifiers described above. Forexample, instead of providing one feedback array in system memory, withboth GPUs writing feedback data to the same location for a given frame,each GPU may write to a corresponding entry of a different feedbackarray, and the feedback data may include timing information, e.g., atimestamp indicating when each GPU finished a particular frame. In thisembodiment, the graphics driver is configured to use the timinginformation to determine whether one GPU is consistently using more timeper frame than another and adjust the clip rectangles accordingly tobalance the load. It should be noted that, in some systemimplementations, timestamps might not accurately reflect the performanceof the GPUs; in addition, determining relative loads from sequences oftimestamps for each GPU generally requires more computational steps thansimply computing a load coefficient as described above. Nevertheless, itis to be understood that embodiments of the invention may include timinginformation in the feedback data instead of or in addition to GPUidentifiers.

Multi-processor graphics processing systems may include more than twoGPUs, and processes 400 and 500 may be adapted for use in such systems.For example, one embodiment of the present invention provides threeGPUs, with each GPU being assigned a different horizontal band of thedisplay area, as shown in FIG. 6. An M-line display area 600 ispartitioned into a top portion 602 that includes lines 1 through K, amiddle portion 604 that includes lines K+1 through L, and a bottomportion 606 that includes lines L+1 through M. Data for top portion 602is generated by a GPU 614 a having an identifier value of “0” (referredto herein as GPU-0); data for middle portion 604 is generated by a GPU614 b having an identifier value of “1” (referred to herein as GPU-1);and data for bottom portion 606 is generated by a GPU 614 c having anidentifier value of “2” (referred to herein as GPU-2). Load balancing isachieved by adjusting the values of K and L.

More specifically, in one embodiment, the command stream for each GPU issimilar to that of FIG. 3, but two feedback arrays of dimension B(referred to herein as feedback01[0:B-1] and feedback12[0:B-1]) areprovided, as shown in FIG. 7. In response to the write notifier command306, GPU-0 writes its identifier value to a location in the feedback01array 702 (writing is indicated by arrows in FIG. 7), GPU-1 writes itsidentifier value to respective locations in both the feedback01 andfeedback12 arrays 702, 704, and GPU-2 writes its identifier value to alocation in the feedback12 array 704. As a result, an average value ofthe feedback01 array reflects the relative loads on GPU-0 and GPU-1,while an average value of the feedback12 array reflects the relativeloads on GPU-1 and GPU-2.

To balance the loads, the graphics driver adjusts the value of K basedon a load coefficient determined from the feedback01 array, e.g., inaccordance with process 500 of FIG. 5 described above (with balanceoccurring when the load coefficient is 0.5), and adjusts the value of Lbased on a load coefficient determined from the feedback12 array, e.g.,in accordance with process 500 (with balance occurring when the loadcoefficient is 1.5). While the relative loads of GPU-0 and GPU-2 are notdirectly compared, over time all three loads will tend to becomeapproximately equal. For example, if the load on GPU-1 exceeds the loadon GPU-0, the average value of entries in the feedback01 array willexceed 0.5; as a result the value of K will be increased, therebyreducing the load on GPU-1 If the reduced load on GPU-1 becomes lessthan the load on GPU-2, this disparity will be reflected in the averagevalue of entries in the feedbackO2 array, which will exceed 1.5; inresponse, the value of L will be increased, thereby increasing the loadon GPU-1 again. This change may lead to a further adjustment in thevalue of K, and so on. Those of skill in the art will appreciate thatover time, this load-balancing process will tend to equalize all threeloads. Some instability may persist, but this is acceptable as long asany overhead associated with modifying the clip rectangles in responseto new values of K and/or L is sufficiently small.

It will be appreciated that this load-balancing technique may be furtherextended to systems with any number of GPUs. For instance, the displayarea can be divided into any number of horizontal bands, with each bandbeing assigned to a different GPU. In such embodiments, the number offeedback arrays is generally one less than the number of GPUs.Alternatively, vertical bands may be used.

It should also be noted that the identifier of a particular GPU need notbe unique across all GPUs, as long as the two GPUs that write to eachfeedback array have identifiers that are different from each other. Forexample, in the embodiment shown in FIG. 6, GPUs 614 a and 614 c mightboth be assigned identifier “0.” This would not create ambiguitybecause, as FIG. 7 shows, these GPUs do not write their identifiers tothe same feedback array.

In another alternative embodiment, a combination of horizontal andvertical partitions of the display area may be used to assign portionsof the display area to GPUs. For example, FIG. 8 shows a display area800 consisting of M lines, each containing N pixels, that is dividedinto four sections 801-804. Sections 801-804 are rendered, respectively,by four GPUs 814 a-814 d as indicated by arrows. Each GPU 814 a-814 d isassigned a different identifier value (0, 1, 2, 3). In this embodiment,it may be assumed that complexity of an image is generally about equalbetween the left and right sides, in which case the vertical boundaryline J may remain fixed (e.g., at J=N/2). Two feedback arrays areprovided; GPU-0 (814 a) and GPU-1 (814 b) write their identifiers to afirst feedback array feedback01 while GPU-2 (814 c) and GPU-3 (814 d)write their identifiers to a second feedback array feedback23. Theboundary line K that divides sections 801 and 802 is adjusted based onthe average value of entries in the feedback01 array, while the boundaryline L that divides sections 803 and 804 is adjusted based on theaverage value of entries in the feedback23 array.

In yet another alternative embodiment, the vertical boundary line Jmight also be adjustable. For instance, GPU-0 and GPU-1 could each beassigned a secondary (column) identifier value of “0” while GPU-2 andGPU-3 are each assigned a secondary identifier with a value of “1.” Athird feedback array feedbackC may be provided, with each GPU writingits secondary identifier to the feedbackC array in addition to writingits primary identifier to the appropriate one of the feedback01 andfeedback23 arrays. The vertical boundary line J can then be adjustedbased on the average value of entries in the feedbackC array.Alternatively, the primary identifier (which has values 0-3) can beassociated with the vertical division while the secondary identifier(which has values 0 and 1) is associated with the horizontal division.

The techniques described herein may also be employed in a “multi-card”graphics processing subsystem in which different GPUs reside ondifferent expansion cards connected by a high-speed bus, such as a PCIX(64-bit PCI Express) bus or a 3GIO (third-generation input/output) buspresently being developed. An example of a multi-card system 900 isshown in FIG. 9. Two graphics cards 912 a, 912 b are interconnected by ahigh-speed bus 908; it is to be understood that any number of cards maybe included and that high-speed bus 908 generally also connects to otherelements of a computer system (e.g., various components of system 100 asshown in FIG. 1). Each graphics card has a respective GPU 914 a, 914 band a respective graphics memory 916 a, 916 b that includes a displaybuffer 922 a, 922 b. Card 912 a has scanout control logic 920 thatprovides pixel data from display buffer 922 a to a display device 910.Card 912 b may also include scanout control logic circuitry, but in thisexample, card 912 b is not connected to a display device and any scanoutcontrol logic present in card 912 b may be disabled.

In this arrangement, spatial parallelism can be implemented, with eachGPU 914 a, 914 b rendering a portion of each frame to its display buffer922 a, 922 b. In order to display the frame, pixel data from displaybuffer 922 b is transferred (e.g., using a conventional block transfer,or Blit, operation) via bus 908 to display buffer 922 a, from which itis read by scanout control logic 920.

Load balancing as described above can be implemented in this system andadvantageously takes into consideration time consumed by the datatransfers. For example, FIG. 10 shows respective command streams 1000 a,1000 b for GPUs 914 a, 914 b, which are generally similar to commandstream 300 of FIG. 3. Each command stream begins with a clip rectanglecommand (CR) 1002 a, 1002 b, followed by rendering commands 1004 a, 1004b for a frame F0. As in the single-card embodiments described above,different clip rectangle boundaries are provided for each GPU 914 a, 914b so that each renders a different portion of the frame; the renderingcommands to each GPU may be identical or different as appropriate for aparticular embodiment.

In this embodiment, pixel data from display buffer 922 b is transferredto display buffer 922 a prior to scanout. Accordingly, for GPU 914 b,the rendering commands 1004 b are followed by a Blit command 1006 thatinstructs GPU 914 b to transfer pixel data from local display buffer 922b to display buffer 922 a on card 912 a so that it can be scanned out.Since GPU 914 a writes pixel data directly to display buffer 922 a, aBlit command is not required in command stream 1000 a, so the renderingcommands 1004 a for GPU 914 a are followed by a “no-op” 1005. The no-opmay be, e.g., a command that simply delays execution of a followingcommand (such commands are known in the art), no command, or a commandinstructing GPU 914 a to ignore a Blit command that appears in itscommand stream.

A write notifier command 1008 a for frame F0 follows the no-op command1005 in command stream 1000 a, and a corresponding write notifiercommand 1008 b follows Blit command 1006. The write notifier commands1008 a, 1008 b may be implemented similarly to the write notifiercommands described above with reference to process 400 of FIG. 4. A loadbalancing process such as process 500 of FIG. 5 may be used to balancethe load.

It should be noted that the time required for the Blit operations isaccounted for in the load balancing process because the write notifiercommand 1008 b for a frame F0 is not executed by GPU 914 b until afterthe Blit operation for the frame F0 is executed. Thus, the renderingtime for GPU 914 a is balanced against the rendering time plus the Blittime for GPU 914 b.

In some multi-card embodiments used to render scenes in which foregroundregions (most often but not always at the bottom of the display area)are consistently more complex than background regions, a performanceadvantage can be gained by assigning GPU 914 a to process the backgroundregion of the scene and assigning GPU 914 b to process the foregroundregion. For example, in FIG. 2, suppose that the foreground appearstoward the bottom of display area 200. In that case, GPU 914 a would beassigned to render top region 202 while GPU 914 b would be assigned torender bottom region 204. The higher complexity of the foreground(bottom) region tends to increase the rendering time of GPU 914 b. Inresponse, the load-balancing processes described herein will tend tomove the boundary line P toward the bottom of the display area. Thisdecreases the number of lines of data included in bottom region 204,which reduces the amount of data that needs to be transferred to displaybuffer 922 a by the Blit command 1006. As a result, more of theprocessing capacity of GPU 914 b may be used for computations ratherthan data transfers, resulting in a net efficiency gain.

Those of ordinary skill in the art will recognize that a similarimplementation might also be used in embodiments of a single-cardmulti-processor system in which pixel data from all GPUs is transferredto a single display buffer prior to scanout. For example, in system 112of FIG. 1, data from display buffer 122 b might be transferred todisplay buffer 122 a to be scanned out, so that scanout control logic120 can simply access display buffer 122 a to obtain all of the pixeldata for a frame. In this embodiment, GPU 114 b can be instructed toperform a Blit operation before the write notifier instruction, whileGPU 114 a is given a no-op.

While the invention has been described with respect to specificembodiments, one skilled in the art will recognize that numerousmodifications are possible. For instance, in a multi-processor graphicsprocessing system, any number of GPUs may be included on a graphicscard, and any number of cards may be provided; e.g., a four-GPUsubsystem might be implemented using two cards with two GPUs each, or athree-GPU subsystem might include a first card with one GPU and a secondcard with two GPUs. One or more of the GPUs may be a motherboard-mountedgraphics co-processor.

Rendering of a display frame may be divided among the GPUs in horizontalbands and/or vertical bands. Those of skill in the art will recognizethat use of vertical bands may result in more uniform sizes of theregions rendered by different GPUs (since image complexity usuallyvaries less from left to right than from top to bottom), while use ofhorizontal bands may simplify the scanout operation in a horizontalrow-oriented display device (since only one GPU's display buffer wouldbe accessed to read a particular row of pixels). In addition, a framemay be partitioned among the GPUs along both horizontal and verticalboundaries, and load balancing may be performed along either or bothboundaries as described above.

Embodiments of the invention may be implemented using special-purposehardware, software executing on general-purpose or special-purposeprocessors, or any combination thereof. The embodiments have beendescribed in terms of functional blocks that might or might notcorrespond to separate integrated circuit devices in a particularimplementation. Although the present disclosure may refer to ageneral-purpose computing system, those of ordinary skill in the artwith access to the present disclosure will recognize that the inventionmay be employed in a variety of other embodiments, includingspecial-purpose computing systems such as video game consoles or anyother computing system that provides graphics processing capability withmultiple graphics processors.

Computer programs embodying various features of the present inventionmay be encoded on computer-readable media for storage and/ortransmission; suitable media include magnetic disk or tape, opticalstorage media such as compact disk (CD) or DVD (digital video disk),flash memory, and carrier signals for transmission via wired, optical,and/or wireless networks conforming to a variety of protocols, includingthe Internet. Computer-readable media encoded with the program code maybe packaged with a compatible device such as a multi-processor graphicscard or provided separately from other devices (e.g., via Internetdownload).

Thus, although the invention has been described with respect to specificembodiments, it will be appreciated that the invention is intended tocover all modifications and equivalents within the scope of thefollowing claims.

1-27. (canceled)
 28. A method for load balancing in a graphics processing system, the method comprising: assigning a portion of a rendering process for a frame to be performed by each of a plurality of graphics processors in the graphics processing system; instructing the graphics processors to perform the rendering process, wherein each graphics processor performs the portion of the rendering process assigned thereto; instructing the graphics processors to provide feedback data reflecting respective rendering times for each graphics processor; and for at least a first pair of graphics processors selected from the plurality of graphics processors: determining, based on the feedback data provided by the first pair of graphics processors, whether an imbalance exists between respective loads of the first pair of graphics processors; and in the event that an imbalance exists, shifting a subset of the portion of the rendering process assigned to a more heavily loaded one of the first pair of graphics processors to the less heavily loaded one of the first pair of graphics processors.
 29. The method of claim 28 wherein the portion of the rendering process assigned to each of the plurality of graphics processors includes generating pixel data for a subset of pixels of the frame.
 30. The method of claim 28 further comprising: assigning a first processor identifier to one of the first pair of graphics processors; and assigning a second processor identifier to the other of the first pair of graphics processors, wherein the feedback data received from the graphics processors in the first pair of graphics processors includes the processor identifiers.
 31. The method of claim 28 wherein the feedback data received from the first pair of graphics processors includes data indicating which one of the first pair of graphics processors is last to finish the portion of the rendering task assigned thereto.
 32. The method of claim 31 wherein the feedback data from the one of the first pair of graphics processors that is last to finish overwrites feedback data from the other one of the first pair of processors.
 33. The method of claim 28 further comprising, for a second pair of graphics processors selected from the plurality of graphics processors: determining, based on the feedback data provided by the second pair of graphics processors, whether an imbalance exists between respective loads of the second pair of graphics processors; and in the event that an imbalance exists, shifting a subset of the portion of the rendering process assigned to a more heavily loaded one of the second pair of graphics processors to the less heavily loaded one of the second pair of graphics processors.
 34. The method of claim 33 wherein one of the graphics processors in the first pair of graphics processors also belongs to the second pair of graphics processors.
 35. The method of claim 33 wherein neither of the graphics processors in the first pair of graphics processors belongs to the second pair of graphics processors.
 36. The method of claim 28 further comprising: instructing one graphics processor of the first pair of graphics processors to transfer result data from the portion of the rendering process assigned thereto to another one of the plurality of graphics processors.
 37. The method of claim 36 wherein the other one of the plurality of graphics processors is the other one of the first pair of graphics processors.
 38. The method of claim 36 wherein the act of instructing one graphics processor of the first pair of graphics processors to transfer the result data is performed subsequently to the act of instructing the plurality of graphics processors to perform the rendering task and prior to the act of instructing the plurality of graphics processors to transmit the feedback data.
 39. The method of claim 28, further comprising: defining a plurality of storage locations, each storage location associated with a different one of a plurality of frames, wherein the act of instructing the graphics processors to provide feedback data includes: instructing a first one of the first pair of graphics processors to store, after completion of the portion of the rendering process assigned thereto for one of the plurality of frames, a first processor identifier in the one of the storage locations associated with the one of the frames; and instructing a second one of the first pair of graphics processors to store, after completion of the portion of the rendering process assigned thereto for the one of the frames, a second processor identifier in the one of the storage locations associated with the one of the frames, wherein the processor identifier written by the first one of the first pair of graphics processors that was last to finish the portion of the rendering process assigned thereto overwrites the processor identifier of the second one of the first pair of graphics processors. 